'무료 런치'의 종말
수십 년 동안 개발자들은 "순차적 한계"라는 시대를 누렸다. 이 시대에는 데너드 스케일링 모든 새로운 칩 세대가 더 빠른 클럭 속도를 보장했다. 하지만 우리는 이제 파워 월에 도달했으며, 성능은 더 이상 주파수의 함수가 아니다. 그것은 병행성의 함수가 되었다. 앞으로 나아가기 위해서는 계산 사고력 을 활용해 추상적인 수치적 방법 과 현대적인 병렬 실행 모델사이의 격차를 메워야 한다.
정밀도-성능 갈등
분자 역학과 같은 도메인 문제 예를 들어) 다중 코어 호스트 에서 CUDA 장치 로 옮기는 것은 문법 변경을 넘어서는 것이다. 그것은 문제 분해 방식의 변화를 의미한다. 병렬화할 때 자주 연산 순서를 바꾸게 된다. 부동소수점 연산은 결합 법칙이 성립하지 않기 때문에, 우리는 다음과 같은 상충 관계를 마주하게 된다: 부동소수점 정밀도 대 정확도병렬 결과는 수학적으로는 타당할지라도, 순차적 계산 결과와 수치적으로는 다를 수 있다.
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the primary reason the 'Sequential Ceiling' was reached?
The end of Moore's Law entirely.
Thermal limits and the Power Wall hindering frequency scaling.
Lack of developer interest in C++.
The transition to quantum computing.
✅ Correct!
Dennard scaling failed because we could no longer reduce voltage as transistors got smaller, leading to unsustainable heat at high frequencies.❌ Incorrect
Transistor counts still grow, but we can't run them faster sequentially due to the 'Power Wall'.QUESTION 2
According to Amdahl's Law, if 5% of a program is strictly sequential, what is the maximum theoretical speedup?
Infinite speedup.
Approximately 20x.
5x.
100x.
✅ Correct!
1 / (0.05 + 0) = 20. The 5% sequential fraction limits the speedup regardless of parallel resources.❌ Incorrect
The sequential fraction (s) dictates the ceiling: 1/s.QUESTION 3
Why might a parallel Molecular Dynamics simulation yield slightly different results than a sequential one?
The CPU uses 64-bit while the GPU only uses 8-bit.
Floating-point addition is non-associative in parallel execution.
Parallel threads randomly skip calculations.
The CUDA compiler ignores numerical methods.
✅ Correct!
In parallel, operations are often reordered. (A+B)+C != A+(B+C) in floating-point due to rounding error accumulation.❌ Incorrect
Floating-point non-associativity is the culprit; the order of summation matters for precision.QUESTION 4
What does 'Problem Decomposition' involve in the context of parallel programming?
Breaking code into functions for readability.
Mapping domain-specific data to parallel execution models like threads or grids.
Deleting unnecessary variables to save memory.
Compiling the code for multiple OS targets.
✅ Correct!
Decomposition involves identifying how data (data-level) or functions (task-level) can be computed independently.❌ Incorrect
It refers to the structural strategy used to unlock concurrency.QUESTION 5
Which of the following describes the 'Computational Thinking' bridge?
A hardware component between the CPU and GPU.
A framework to translate domain knowledge into architecture-aware algorithms.
An automated AI tool that writes CUDA kernels.
The process of upgrading RAM on a host machine.
✅ Correct!
It is the mental framework that synthesizes domain problems with hardware constraints and programming models.❌ Incorrect
It is a strategic methodology, not a physical component.Case Study: Scaling a Fluid Dynamics Solver
Numerical Consistency in Parallel Architectures
A research team is porting a Navier-Stokes fluid solver from a single-core CPU to a cluster of CUDA devices. While the GPU version is 50x faster, the pressure values at the 100th iteration differ from the sequential version by 0.001%.
Q
1. Is this divergence likely caused by a hardware bug or a fundamental property of parallel computing?
Solution:
It is a fundamental property. In parallel reductions, the order of summation for grid cells changes. Due to floating-point non-associativity, these small rounding differences accumulate over time.
It is a fundamental property. In parallel reductions, the order of summation for grid cells changes. Due to floating-point non-associativity, these small rounding differences accumulate over time.
Q
2. How does the 'Sequential Ceiling' explain why the team chose CUDA over a faster single-core CPU?
Solution:
Dennard Scaling has ended; single-core frequency no longer increases significantly. Massive parallelism on CUDA devices is the only way to achieve the required throughput, even if it requires managing numerical accuracy trade-offs.
Dennard Scaling has ended; single-core frequency no longer increases significantly. Massive parallelism on CUDA devices is the only way to achieve the required throughput, even if it requires managing numerical accuracy trade-offs.